Stupid Lucene Tricks: Document Frequencies and NOT

Mark Leighton Fisher on 2010-07-29T16:13:30

  1. You can get the document frequency of a term (i.e. how many documents have that term) through Lucene.Index.IndexReader.DocFreq(t As Term) As Integer.

  2. You can get the IndexReader for a Lucene.Search.IndexSearcher through IndexSearcher.GetIndexReader().

  3. If you want to display the document frequencies for the individual keywords of a search, and a piece is a NOT phrase (like -antibiotic in antimicrobial -antibiotic), you cannot use DocFreq() directly. In that case, the document frequency can be computed as:
          DOCFREQ = count of all documents - DocFreq(TERM_NO_NOT)
    

    as in:

          DOCFREQ = 60227 - DocFreq(New Term("all", "antibiotic"))
    
    where the NOT piece was -antibiotic and all is the Lucene document field in question.

(Ob. Perl: Although PLucene is now 5 years out of date, Perlesque should eventually let you get at Lucene.NET via a strongly-typed Perl 6.)